Data Scientist Capstone

By Marwan Saeed Alsharabbi

COVID-19 Analysis ,visualization & Prediction

Definition

Project Overview

Coronavirus is a family of viruses that are named after their spiky crown. The novel coronavirus, also known as SARS-CoV-2, is a contagious respiratory virus that first reported in Wuhan, China. On 2/11/2020, the World Health Organization designated the name COVID-19 for the disease caused by the novel coronavirus. This notebook aims at exploring COVID-19 through data analysis and projections. The world is going through a difficult time and fighting with a deadly virus called COVID-19. Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It was first identified in December 2019 in Wuhan, China, and has resulted in an ongoing pandemic. The first case may be traced back to 17 November 2019.As of 8 June 2020, more than 7.06 million cases have been reported across 188 countries and territories, resulting in more than 403,000 deaths. More than 3.16 million people have recovered.

Select a real-world dataset

I chose the Covid 19 data set from the following site(https://ourworldindata.org/coronavirus), and I will analyze the data, clean and perform some interesting processes and conclusions. I will strengthen the analysis and cleaning of global data. The data was downloaded from https://covid.ourworldindata.org/data/owid-covid-data.csv.

Data Sources:

Confirmed cases and deaths: Data comes from the European Centre for Disease Prevention and Control (ECDC) Testing for COVID-19: Data is collected by the Our World in Data team from official reports; you can find the source information for every country and further details in the post on COVID-19 testing. The testing dataset is updated around twice a week. Confirmed cases and deaths: Data is collected from a variety of sources (United Nations, World Bank, Global Burden of Disease, etc.)

License:

The information on this page is summarized from OWID's COVID-19 github page. All of Our World in Data is completely open access and all work is licensed under the Creative Commons BY license. More information about the usage of content can be found OWID github page.https://github.com/owid/covid-19-data/tree/master/public/data

Authors:

OWID's COVID19 github page the data has been collected, aggregated, and documented by Diana Beltekian, Daniel Gavrilov, Joe Hasell, Bobbie Macdonald, Edouard Mathieu, Esteban Ortiz-Ospina, Hannah Ritchie, Max Roser.

Problem Statement

1- Created a Linear regression-Forecast model :

Created a Linear regression model and fit the model with owid COVID19 data, predicted the world death projection for the next 30 days. In this project I have used sklearn for creating Linear Regression model and created training split with 80 to 20%. The trained the model and predicted the death for next 30 days. Also created model using XGBoost for improving the linear regression model and fit the model with owid COVID19 data, predicted the world death projection for the next 30 days.

2- Create a model that can predict the risk of mortality for people over 65 years of age and over using KNN :

I will create a model that can predict the risk for the Case Mortality Ratio of a Country utilizing its Life Expectancy, Percentage of Population over 65, and Percentage of diabetes_prevalence and cardiovasc_death_rate ?

It decided on using Population Over Age 65 and cardiovasc_death_rate and Diabetes and other features because in the world, over 80% of the deaths were in the population 65 and over, and the CDC has stated that 94% of deaths had some underlying health condition. We also used Life Expectency per country to account for possible deficiencies in the health care system. John Hopkins University has listed several diseases such as heart disease and Diabetes which are known to be exacerbated by Obesity. Our idea is that we can more accurately predict the Mortality Ratio of COVID-19 by using both population 65 and over and Obesity rather than just population 65 and over. This may show that creating a healthier population is the best way to prevent the devastation in future pandemics that the world is currently facing

Metrics and Methodology

Linear regression-Forecast model

k-nearest neighbors(KNN) algorithm

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

Importing Libraries

Analysis

Data Exploration

Data from the file is read and stored in a DataFrame object - one of the core data structures in Pandas for storing and working with tabular data. We typically use the _df suffix in the variable names for dataframes.

Step 2: Perform data preparation & Cleaning

For now, let's assume this was indeed a data entry error. We can use one of the following approaches for dealing with the missing or faulty value:

It is not really logical to delete Nan values but replace with 0, because that would confirm that the result was static because the data is historical and adopts high time series, we cannot replace or delete even the most data in the rows because it is data historical

Numerical Features

I'd rather copy from the list than from Pandas Profiling

Categorical Features

Data Visualization

Data Visualization

Loading the cleaned Data and Exploring the histogram of the data looks like

It appears that each column contains values of a specific data type. For the numeric columns, you can view the some statistical information like mean, standard deviation, minimum/maximum values and number of non-empty values using the .describe method

It appears that each column contains values of a specific data type. You can view statistical information for numerical columns (mean, standard deviation, minimum/maximum values, and the number of non-empty values) using the .describe method.

Questions

Data Understanding

While we ahve looked at overall numbers for the cases, tests, positive rate etc., it would be also be useful to study these numbers on a month-by-month basis. The date column might come in handy here, as Pandas provides many utilities for working with dates.

You can see that it now has the datatype datetime64. We can now extract different parts of the data into separate columns, using the DatetimeIndex class

Question#1

How the many total population in each location by continents from our datase

Question#1

How the many total population in each location by continents from our datase

Question#2

The 10 top population total in each location by continents from our dataset

The 10 top population total in each location by continent 'Africa from our dataset

Create a data frame showing the total population of each continent

Question#3

Show countries in Asia,Europe and North America the total_cases and total_deaths,new_cases,total_tests, total_vaccinations by mean, and max

Question#4

Let's see the speed of transmission of the Corona virus between countries on the map .

Worldwide spread

Coronavirus is continuing its spread across the world with almost 100 million confirmed cases in 191 countries and more than two million deaths. and the virus has been detected in nearly every country, as these maps show.

We can see trend covid-19 moving to China -> Europe -> US

You can click each country and see the number representing the spread of the virus.

We can see trend covid-19 moving to China -> Europe -> US on map

Question#5

Let's see number of total_cases,total_deaths,total_deaths_per_million,test per confirmed(%) on map.

COVID-19 maps

Let's see number of confirmed cases on map.

For africa regions, the confirmed cases is lower than other continents, I guess this is due to the fact that number of tests is quite low.

You can click each country and see the number of the total confirmed cases.

We can see US,Brazil and India are distinctive

Let's see number of deaths on map.

You can click each country and see the number of the total deaths.

We can see US,Brazil,Mexico and India are distinctive

Let's see number of total deaths per million on map.

You can click each country and see the number of the total deaths per million

We can see that south,north America and europe has the most number of total deaths per million

Question#5

Top 15 countries for the total_cases,total_deaths,total_deaths_per_million,total_tests,people_fully_vaccinated and total_vaccinations on plot_hbar and Visulizing Treemaps

Top 15 countries

Visulizing Treemaps

We used this technique of data visulizing to display hierarchical data using nested rectangles,And accurately display multiple elements together

Line Plot function

We used this technique of data visualization to plot line display day by day trend ,And accurately display multiple elements together

Question#6

How many the New Deaths Smoothed day by day in continents

Question#7

How many the new vaccinations smoothed day by day in continents

Question#8

How many the New Tests Smoothed day by day in continents

Question#8

How many the positive_rate day by day in continents

Test population coverage

Question#9

find some gdp_per_capita and new_cases clusters over countries

Question#10

find some new_deaths_smoothed_per_million, handwashing_facilities and extreme_poverty clusters over countries

Question#11

find some new_deaths_smoothed_per_million, aged_70_older and population_density clusters over countries

Question#12

find some new_deaths_smoothed_per_million, life_expectancy and hospital_beds_per_thousand clusters over countries

Stringency Index and death rate correlation

Modeling

Correlation Analysis

Linear Regression-Forecast

Created a Linear regression model and fit the model with owid COVID19 data, predicted the world death projection for the next 30 days. In this project I have used sklearn for creating Linear Regression model and created training split with 80 to 20%. The trained the model and predicted the death for next 30 days. Also created model using XGBoost for improving the linear regression model and fit the model with owid COVID19 data, predicted the world death projection for the next 30 days.

k-nearest neighbors(KNN) algorithm

I will create a model that can predict the risk for the Case Mortality Ratio of a Country utilizing its Life Expectancy, Percentage of Population over 65, and Percentage of diabetes_prevalence and cardiovasc_death_rate ? 

It decided on using Population Over Age 65 and diabetes_prevalence cardiovasc_death_rate because in the world, over 80% of the deaths were in the population 65 and over, and the CDC has stated that 94% of deaths had some underlying health condition. We also used Life Expectancy per country to account for possible deficiencies in the health care system. John Hopkins University has listed several diseases such as heart disease and Diabetes which are known to be exacerbated by cardiovasc_death_rate and Obesity. Our idea is that we can more accurately predict the Mortality Ratio of COVID-19 by using both population 65 and over and Obesity rather than just population 65 and over. This may show that creating a healthier population is the best way to prevent the devastation in future pandemics that the world is currently facing

After viewing the graphs in Linear Regression-Forecast we the accuracy that XGboost algorithms can achieve with this data. . We will continue and see if our ML Algorithm can do better than we are expecting. We have initially chosen to use categorization with the HighRisk category as that may be more accurate than regression. Or can we use more precise algorithms to build a data-appropriate learning model?

Correlation Analysis

We will be using the diabetes, cardiovascular health, percent of poplation above 70 and any other data we find to be the most useful to see if we can get better results with these features.

Now we will split our data for the Machine Learning Algorithm using the High Risk Category as our target and Life_Expectancy,icu_patients ,diabetes_prevalence, and aged_65_older as features

It looks like despite our initial reservations that KNN was able to get a decent accuracy of 90.19 %

Let's test which k value gets us our best accuracy.

Interestingly a slightly better classification of 90.518% with k =7.

A further look at our predictions and Y_test values show that we get 90.226% simply by predicting almost everything as False so this model's features and data should be improved

Now we will try to test all the features we currently have and select through a greedy algorithm the best features utilizing KNN and all K-Ranges between 1 and 7

Looking at the data above we have gotten a bit better accuracy using k=7 with the following 4 features:

population_density', 'icu_patients', 'female_smokers', 'human_development_index'

The model using the extra features especially the human_development_index, smoker data and more recent target data has gotten better at predicting a countries rate of mortality vs population going from 90% to 93% accuracy depending on the randomization.

This may be due to different reporting systems for what is and is not a covid death and overall accuracy of the inputs.

Our original Hypotheses that Age and Obesity would be factors seem to have been proven true through the data, one might even be able to try regression on the normalized mortality / population and if we had the One World Data originally we may have even gone further and tried that as the correlation seems to be stronger

Inferences and Conclusion

Based on the analysis, there are several things we can conclude.

1- The United States and India are more affected by Covid 19, and the number of injuries and deaths is very high. Nevertheless, it is the first in the world to be vaccinated from the Corona vaccine 

2-Utilizing a High Risk classification target that was determined via those countries whose Covid Death per million population was greater than 0.65 std deviations representing about 25% of the countries, we evaluated over 12 features provided in the dataset using kNN and a range of k values. The following four features showed an accuracy of about 93 % using kNN with k=7

3- It's got accuracy :0.9976567640125802 from Linear Regression-Forecast Fitted Predicted the Death rate for next 30 days with independent variable like age, gdp, diabetes, smokers, hospital beds etc.

4- Although for most people COVID-19 causes only mild illness, it can make some people very ill. More rarely, the disease can be fatal. Older people, and those with pre- existing medical conditions (such as high blood pressure, heart problems or diabetes) appear to be more vulnerable.

5- COVID-19 The Government Response Stringency Index  n the dataset the Government Response Stringency Index is a composite measure based on nine response indicators including school closures, workplace closures, and travel bans, rescaled to a value from 0 to 100 (100 = strictest response).

6- we were able to create several visualizations in the Jupyter Notebook with scatterplots comparing the different features to the High Risk Category that we found to produce the best model. Age, and obesity do seem to be factors in the mortality rate with extra features such as smoking, cardiovascular disease helping improve the numbers further.

7- Commitment to virus prevention tools, especially in densely populated cities Because differences in the population size between countries are often large, and the COVID-19 death count in more populous countries tends to be higher. Because of this it can be insightful to know how the number of confirmed deaths in a country compares to the number of people who live there, especially when comparing across countries.

8-The testing phase of K-nearest neighbor classification is slower and costlier in terms of time and memory. It requires large memory for storing the entire training dataset for prediction. KNN requires scaling of data because KNN uses the Euclidean distance between two data points to find nearest neighbors. Euclidean distance is sensitive to magnitudes. The features with high magnitudes will weight more than features with low magnitudes. KNN also not suitable for large dimensional data

Recommendation and Future Improvements

1- Recommendation Assuming the published data are reliable, the SIR model can be applied to assess the spread of the COVID-19 disease and predict the number of infected, removed and recovered populations and deaths in the communities, accommodating at the same time possible surges in the number of susceptible individuals

2- Preferably exclude vaccine columns, analyzing.

3- It collects separate vaccine data, analyzes it, and creates a prediction model for daily and monthly vaccination

References and Future Work

1- https://www.programcreek.com/python/example/81623/sklearn.metrics.classification_report

2- https://www.python-course.eu/python3_class_and_instance_attributes.php

3- https://thispointer.com/data-analysis-in-python-using-pandas/

4- https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas

5- https://ourworldindata.org/coronavirus

6- https://covid19.moh.gov.sa/

7-https://github.com/

8-https://www.kaggle.com/